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Abstract. Classic algorithms for sequential pattern discovery, return 
all frequent sequences present in a database. Since, in general, only a 
few ones are interesting from a user's point of view, languages based 
on regular expressions (RE) have been proposed to restrict frequent se- 
quences to the ones that satisfy user-specified constraints. Although the 
support of a sequence is computed as the number of data-sequences sat- 
isfying a pattern with respect to the total number of data-sequences in 
the database, once regular expressions come into play, new approaches to 
the concept of support are needed. For example, users may be interested 
in computing the support of the RE as a whole, in addition to the one of 
a particular pattern. Also, when the items are frequently updated, the 
traditional way of counting support in sequential pattern mining may 
lead to incorrect (or, at least incomplete), conclusions. For example, if 
we are looking for the support of the sequence A.B, where A and B are 
two items such that A was created after B, all sequences in the database 
that were completed before A was created, can never produce a match. 
Therefore, accounting for them would underestimate the support of the 
sequence A.B. The problem gets more involved if we are interested in 
categorical sequential patterns. In light of the above, in this paper we 
propose to revise the classic notion of support in sequential pattern min- 
ing, introducing the concept of temporal support of regular expressions, 
intuitively defined as the number of sequences satisfying a target pattern, 
out of the total number of sequences that could have possibly matched 
such pattern, where the pattern is defined as a RE over complex items 
(i.e., not only item identifiers, but also attributes and functions). 



1 Introduction 

Traditional sequential patterns algorithms are founded on the assumption that 
items in databases are static, and that they existed throughout the whole Hfes- 
pan of the world modeled by the database. There are many real-world situations 
where sequential pattern mining (SPM) is usually applied, and where these as- 
sumptions are not valid any more. In these situations, items are created or 
deleted dynamically. Further, if we are interested in categorical SPM, we need 



to deal with complex items, i.e., items described by attributes (or even functions 
over attributes). These attributes are also usually subject to change. Consider 
for example SPM in trajectory databases. For many applications, we may be 
interested in trajectory patterns involving restaurants, hotels, gas stations. The 
features that characterize these places may change over time, and even many 
of them could have not existed when some of the trajectories under analysis 
occurred. This may also occur in the context of the World Wide Web, where 
Web pages arc frequently added or deleted. Ntoulas et al. [10] collected snap- 
shots over 155 web sites, during one year, once a week. They concluded that 
new pages are created at the rate of 8% per week, and only 20% of the pages 
available at one instant will be accessible after one year. Thus, there ia a high 
frequency of creation and deletion of Web pages. Moreover, they found that the 
link structure of the Web is more dynamic that the page content. 

We introduce the problem through a Web usage mining example. Data Min- 
ing techniques have been applied for discovering interaction patterns of WWW 
users. Typically, this mining is performed over the URLs visited during a session, 
recorded in a Web server log. In this way, the interests and behavioral patterns 
of Web users can be studied. Figure 1 depicts a portion of a (simplified) Web log. 
In classic SPM, the support of a sequence S is defined as the fraction of sessions 
that support S. Thus, all sessions are considered as having the same probability 
to support a given sequence. For example, the support of the sequence CBC, 
counted in the classical way, would be 66%, since CBC is present in two of the 
three sessions. Analogously, the support of the sequence CB would also be 66%. 
We may ask would have happened if not all these Web pages existed all the time. 
The question is: would it be realistic to count support in the usual way? More 
precisely, would it be reasonable to ignore the evolution of the items (URLs) 
across time? We discuss these issues in this paper. 

When a Web page is visited during a session, it is often the case where a 
user clicks a nonexisting link or a link that has been removed. Figure 2 shows 
how URLs A, B, and C in Figure 1, have evolved, and the time intervals when 
each URL has been available. We can see that URL A was available during the 
interval [1, 8], URL B in intervals [4, 5] and [11, now], and URL C, during [3, 
6] and [9, noiu]. (We use the term now to refer to the current time instant). 
We now analyze the support of the sequence CBC. During session s2, we can 
see that URL C did not exist at t=8, when the user clicked URL A. Thus, 
session s2 did not have the possibility of producing a sequence that finishes 
with the URL C. Sessions si and s3, instead, support the sequence CBC. Then, 
ignoring the evolution of these URLs, the support of sequence CBC would be 
66%, but, if we do not count session s2, we would obtain a support of 100% for 
this sequence. Analogously, if we compute the support of the sequence of CB 
taking into account the availability of the items during each session, we can see 
that si and s3 support this sequence, but s2 does not. However, C was available 
during session s2, when the user clicked URLs A (t=3) and B (t=4) (actually, it 
existed in the interval [3, 6]). Thus, the user could have produced the sequence 
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Fig. 1. Web user interaction 
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Fig. 2. Evolution of three URLs A, B, and C 

CB, although she decided to follow a different path. Session s2 must then be 
counted for computing the support of the sequence CB, which would be 66%. 

The example above gives the intuition of the ideas that we discuss and for- 
malize in this paper: the support of a sequence depends on the counting method, 
and when items evolve over time, new definitions of support are needed. Instead 
of considering all sequences in the database in the same way, we propose to ac- 
count for the fact that some of these sequences could have never been produced 
due to the temporal unavailability of some of the items in them. 

1.1 Related Work 

Sequential Pattern discovery in databases has been studied for a long time. Clas- 
sic algorithms [1, 12] return all frequent sequences present in a database. How- 
ever, more often than not, only a few ones are interesting from a user's point 
of view. Thus, post-processing tasks are required in order to discard uninterest- 
ing sequences. To avoid this drawback, languages based on regular expressions 
(RE) were proposed to restrict frequent sequences to the ones that satisfy user- 
specified constraints. Garofalakis et al. [4, 5] address this problem by pruning the 
candidate patterns obtained during the mining process by adding user-specified 
constraints in the form of regular expressions over items. The algorithm returns 
only the frequent patterns that satisfy these regular expressions. Toroslu and 
Kantarcioglu [16] limit the number of sequences to be found through a param- 
eter called repetition support. The idea consists in detecting cyclically repeating 



patterns. The parameter specifies the minimum number of repetitions of the pat- 
terns within each data-sequence. Thus, the algorithm finds frequent sequences 
with at least minimum support and a cyclic repeating pattern. 

Recently, the data mining community started to discuss new notions of sup- 
port in SPM, that account for changes of the items database across time. Al- 
though this problem has already been addressed for Association Rule mining, 
where the concept of temporal support has already been introduced [7,8,14], 
this has been overlooked in SPM. To the best of our knowledge, the works we 
comment below are the only ones partially addressing the issue. 

Masseglia et al. [9], and Parthasarathy et al.[ll], study the so-called incre- 
mental sequential pattern mining problem. This problem arises when items are 
appended to a database. They focus on designing efficient algorithms in order to 
avoid re-scanning the entire database when new items appear. They address the 
addition of items to existing transactions, and the addition of new transactions. 
In the absence of new transactions, the previously computed frequent patterns 
will still be frequent in the new database, and the problem consists in detecting 
the occurrences of new frequent patterns. In the presence of new transactions, 
however, old frequent patterns may or may not be frequent in the incremental 
database. Recently, Huang et al. [6] address the problem of detecting frequent 
patterns valid during a defined period of interest, called POL For example, if new 
items appear, and no new transactions were generated, old frequent sequences 
would still be frequent. 

1.2 Contributions 

In addition to the problem of item evolution and availability commented above, 
we believe that other scenarios have been overlooked so far. For example, when 
regular expressions (from now on, RE) are used to prune non-interesting pat- 
terns, we may ask ourselves if a user would be interested not only in the support 
of a sequence, but in the support of an RE as a whole. Let us analyze a sim- 
ple example. The expression {A\B).C is satisfied by sequences like A.C or B.C. 
Even though the semantics of this RE suggests that both of them arc equally 
interesting to the user, if neither of them verifies a minimum support (although 
together they do), they would not be retrieved. The problem gets more involved 
if we are interested in categorical sequential patterns, i.e., patterns like Sci- 
ence. Sports, where Science and Sports are, for instance, categories of Web pages 
in an ontology (in SPIRIT [4,5], the alphabet of the REs is composed only of 
item identifiers). 

In light of the above, we propose to revise, in different ways, the classic notion 
of support for sequential pattern mining. We introduce the concept of temporal 
support of regular expressions, intuitively defined as the number of sequences 
satisfying a target pattern, out of the total number of sequences that could 
have possibly matched such pattern, where the pattern is defined as a RE over 
complex items. We first introduce the data model (Section 2), then we present 
and discuss a theoretical framework for this novel notion of support, and an 
RE-based language (Sections 3 and 4). We conclude in Section 5. 
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Fig. 3. An instance of a Table of Items (Tol) 



2 Data Model 

Depending on the application domain, the items to be mined can be character- 
ized by different attributes. Throughout the paper we refer to an example where 
each Web page is characterized by the following attributes: (a) catName, which 
represents the name of the category of the item'^; (b) keyword, which summa- 
rizes the page contents; (c) filter, specifying a list of URLs that cannot appear 
together with the URL of the item. Finally, ID is a distinguished, mandatory 
attribute, in this case containing the URL that univocally identifies a Web page. 
For each category there are occurrences. In our example, we work with three 
URLs, for simplicity referred to as A, B, and C. We denote set of instances a 
set of occurrences of a collection of categories. The items to be mined are events 
defined over some category occurrence at some instant. These items are stored 
in a so-called Table of Items (Tol). In Figure 3 we show a Tol for our running 
example. 

2.1 Introducing Temporality 

In many real-world applications, assuming that the values of attributes for a 
category occurrence do not change (or even that a category occurrence spans over 
the complete lifespan of the dataset) could not be realistic. Thus, we introduce 
the time dimension into our data model. We do this in the usual way, namely, 
timestamping category occurrences. We assume that the category schema is 
constant across time, i.e., the attributes of a category are the same throughout 
the lifespan of the category. 

Definition 1. [Category Schema] We have a set of aitribute names A, and a set 
of identifier names I. Each attribute a & A is associated with a set of values in 
dom{a), and each identifier ID & I is associated with a set of values in dom{ID). 

^ Although in our running example we have only one category as an instance of cat- 
Name, there are other applications where this is not the case. For example, in a 
trajectory database appHcation analyzing tourist itineraries, items could be catego- 
rized as hotels, restaurants, or tourist attractions, to name a few ones. Each one of 
them could be characterized by different attributes. For instance, the Icind of food 
offered by a restaurant could be an attribute of the category restaurant. 



A category schema S is a tuple {ID, A), where ID € I is a distinguished 
attribute denoted identifier, and ^ is a set of attributes in A. Without loss of 
generality, and for simplicity, in what follows we consider the set A ordered. 
Thus, S has the form [ID, attri, attrn]. □ 

Example 1. In our running example we have only one category, representing Web 
pages with schema [ID, catName, filter, keyword]. □ 

We consider the time as a new sort (domain) in our model. Toman [15], 
showed the equivalence between abstract and concrete temporal databases. The 
former are point-based structures, independent from the actual implementation 
of the database. The latter contains efficient interval-based encodings of the for- 
mer. The author also showed that there is an efficient translation from abstract 
to concrete temporal databases. Formally, if T is a set, and < a discrete lin- 
ear order without endpoints on T, the structure Tp — (T, <) is the Point-based 
Temporal Domain. The elements in the carrier of T model the individual time 
instants, and the linear order < models the succession of time. We consider the 
set T to be N (standing for the natural numbers). We can map individual time 
instants t € N to calendar instants, assuming a reference point and a granu- 
larity. For example, if the reference point is January 1, 1970 00:00 GMT, and 
granularity "minute", t=1440 represents 1440 minutes from that date, i.e., Jan- 
uary 2, 1970 00:00 GMT. In what follows we use calendar time, and granularity 
"minute". In temporal databases, the concepts of valid and transaction times 
refer, respectively, to the instants when data is valid in the real world, and when 
data is recorded in the database [13]. We assume valid time support in this paper 
for the categories, and transaction time support for the items (see Definitions 2 
and 6 below). 

Definition 2. [Category Occurrence] Given a category schema S, a category 
occurrence for S is the tuple [{ID,id),P,t], where ID is the ID attribute of 

Definition 1 above, id G dom{ID), P is the structure [{attri,vi) {attrn, Vn)], 
t is a point in the temporal domain Tp, and: (a) attri = -^{i) in S (remember that 
A is considered ordered); (b) Vi G dom{attri),Vi,i = l..n; (c) All the occurrences 
of the same category have the same set of attributes, at any given time: (d) At 
any instant t, the pair {ID,t) is unique for a category occurrence, meaning that 
no two occurrences of the same category can have the same value for ID at 
the same time; (e) t is the time instant when the information in the category 
occurrence is valid. 

Definition 3. [Category Instance] A set of occurrences of the same category 
is denoted a category instance. We extend the fourth condition in Definition 2 
to hold for the whole set: no two occurrences of categories in the set can have 
the same value for ID at the same instant t (in other words, the pair {ID,t) is 
unique for the whole instance). □ 

Remark 1. In what follows, for clarity, we assume that attri stands for ID. Thus, 
a category occurrence is the set of pairs [( attr\,vi),..., ... , ( attrn,Vn), t]. □ 
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Fig. 4. Category occurrences, granularity "minute" (Point-based). 



Since point-based and interval-based representations are equivalent, in this 
paper we work with the latter. One of the reasons for this is that in an actual 
implementation, we work with intervals. In our encoding, an event is represented 
by an interval whose endpoints are the same. We need to define this encoding in 
a precise way. The following definition states the condition that a set of tuples 
must satisfy in order to belong to the same group. 

Definition 4. [Interval Encoding] Let G be a time granularity, and g a time 
unit for G (e.g., one minute). Given a set of k > category occurrences, 
[{attri,vi), {attrn,Vn),ti], [{attri,vi), . . . , (atir„, w„), ^2], • • • , [{attr i,Vi) . . . 
{attrn,v„),tk], ify i,i = l..k — 1, it holds that t^+i =ti+g, we encode all these 
occurrences in a single tuple [{attri,Vi), . . . , {attrn, Vn), [ti, ife]]- □ 

Example 2. Figure 4 shows a set of (point-based) temporal category occurrences 
for the Web page category in our running example. Figure 5 shows the corre- 
sponding interval-encoded representation (sec below for details). 

Encoding a set of tuples requires these tuples to be consecutive over the 
granularity selected. Thus, if the granularity is "minute", the tuples [(/Z), 'A'), 
{keyword, 'computer'), {filter, "), '12/12/2000 12:31')], and [{ID, 'A'), {keyword, 
'computer'), {filter, "), '12/12/2000 12:33')], cannot be included together in the 
same group, since there is a two-minute gap between them. They must be en- 
coded into two intervals. □ 



Definition 5. [Encoded Category Occurrence] Given a category instance C with 
time granularity G, and a partition VofC such that the number of sets pi gV is 

minim,al. Each set pi is obtained encoding the occurrences in C as in De finition 4, 
i.e., each Pi contains a set of tuples that can be encoded into a single tuple. Thus, 
associated to Pi there is a tuple tp^ = {{ID, id), {attr\,v\), . . . {attrn, Vn),tv,te), 
where (a) ID, attri, attrn are the attributes of the occurrences inpi; (b) id, 
Vi, .... Vn are the values for the attributes in (a); (c) ts is the smallest t of the 
occurrences in pi; (d) tg is the largest t of the occurrences in pi. We denote tp^ 
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Fig. 5. Encoded Category Occurrences for the running example (granularity "minute"). 



an encoded category occurrence (ECO) of the set of occurrences in pi. Given an 
ECO Ci we denote Interval{ei) its associated interval [ts,tg]- □ 

Example 3. Figure 5 shows a set of eight ECOs encoding the category instance 
of Figure 4. The Web page -with ID='P' (not included in Figure 4) had no filter 
when it was created until November 29th, 2007 at 18:50, when the attribute filter 
was updated. Also, the Web page with ID='A' has changed: attribute keyword 
was updated from 'Books' to 'Computers', and also filter was updated. Note 
that there is an interval when this page was not available. After these changes, 
the page was set off-line at 7:30PM on November 29th, 2007. □ 

Adding a time instant to an ECO, produces an item. 

Definition 6 (Item). Let Co = {{ID,v), {attri,vi), . . . {attrn,Vn),ts,te) be an 
ECO for some category instance. An item / associated to Co is the set: {{t, vt), {ID, 
v) , {attri,vi) , . . . {attrmVn) ,ts,te)), such that vt G [ts,te] holds. We denote t the 
transaction time of the item. □ 



3 A Theory for Support Count 

In Section 2 we defined the formal data model we use in the remainder of the 
paper to build a theory that can help to provide insight into the notion of 
support. To begin with, over the elements introduced in Definition 1 through 
6, we build a simple language, based on sequences of constraints, that we use 
later to elaborate the concept of support of regular expressions. In short, this 
language expresses paths of constraints. We define the temporal support of these 
paths, denoted sequential expressions (SE). SEs are at the cornerstone of our 
theory. In the next section, we define a regular language that produces SEs, and 
introduce the notion of temporal support of regular expressions. 



Definition 7. [Terms] There exist no others terms than the following ones: (a) 

Constant: a literal enclosed by simple quotes; (b) Non temporal Attribute: an at- 
tribute in the category schema (e.g. filter, url). (c) Temporal Attribute: t, the tem- 
poral attribute of and item (see Definition 6); (d) Function ofn arguments: Let fn 
be a function symbol, the expression fn{attribute , 'ctl 'ct2 \ 'ctn-i '),n > 1, is 
a function where the first parameter is an attribute (temporal or non-temporal), 
and all the other ones are constants. □ 

Definition 8. [Atoms] Let C, A and F be a set of constants, temporal and non 
temporal attributes and functions, respectively. The expression, term! = term,2 is 
an atom,, where terml & A [J F, term2 € C, and ' = ' is the equality sym,bol. □ 

Definition 9. [Formula] We define recursively a formula by the following rules: 
(a) An atom is a formula; (b) If Fl and F2 are formulas then Fl A F2 is a 
formula, (c) Nothing else is a formula. □ 

Definition 10. [Constraint and Formula of a Constraint] A constraint is a for- 
mula enclosed in squared brackets. Given a constraint C = [F\, we denote T{C) 
the formula of C. □ 

Definition 11. [Sequential Expression] A Sequential Expression (SE) of length 
n is an ordered list of n sub- expressions SEi.SE2....SEn, where each SEi is a 
constraint, yi,i = l..n □ 

Example 4- The sequential expression of length two [ID = 'A' A filter = 
'B, C"].[/£> = 'X'] is composed of two constraints. 

We need to define some operations between intervals. Given two intervals 

I'i = [tsi,te,i\ and Ij = [tsj.tcj] wc say that follows Ij if tsj > tcj. Saying that 
an interval Jj follows another interval Ij, is equivalent to say that Jj is either 
after Ij or /, is met-by Ij in terms of Allen's Interval Algebra [2]. 

Example 5. In Figure 5 we can see that Interval(ecoc3) follows Interval(ecoA2) 
and Interval(ecoc2)- We can also see that Interval(ecoc3) does not follow 

Intcrval(erai\f i). 

Definition 12. [Satisfability of a Constraint] Given a constraint C and an ECO 
E, we say that E satisfies C if one of the following conditions hold: (a) If !F{C) 

is an atom of the form attr — 'ct' where attr is an attribute in any of the 
category occurrences in E, 'ct' is a constant in dom{attr) , and the instantiation 
of attr with its value in E, equals 'ct'. (b) If T{C) is an atom of the form 
fn{attr, 'ctl', 'ct2',..., 'ct„_i ') = 'ct', where attr is an attribute in any of the 
category occurrences in E, 'ct' is a constant in dom{attr), and the instantiation 
of attr in fn with its value in E, makes the equality true, (c) If !F{C) is an 
atom of the form t = 'ct' where t is a temporal attribute, 'ct' is a temporal 
constant in the temporal domain, with granularity G, and 'ct' € Interval{E). 
(d) If !F{C) is an atom of the form fn{t, 'ctl', 'ct2',..., 'ctn-i') = 'ct', where t 
is a temporal attribute, 'ct' is a tem,poral constant in the temporal domain with 
some granularity G, and 3tu G Interval{E) the equality is true, (e) If J^{C) is a 
formula Fl A F2, and Fl and F2 are satisfied by E. 
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Fig. 6. An instance of the Normalized Tol 



Definition 13. [Satisfability of SE] Let EO = {E0i,E02, ■■■EOn) be a list of 
ECOs such that V i, j, 

i < j ^ Interval{Ei) does not follow Interval{Ej). We denote EO a t-ordered 
list of ECOs. A sequential expression SE=SEi.SE2--.-SEn is satisfied by EO if 
EOi satisfies SE^, y i,i — l..n. We denote SLf,{SE) the set composed of the n 
lists of ECOs that satisfy an SE of length k. □ 

Example 6. Let us analyze which ordered Usts of ECOs in Figure 5 satisfy the 
SE [rollup{t, ^hour\ ^Time^) = ^IS^]. [key word — ^Books^]. Here, rollup is the usual 
rollup function [3], that indicates how a member of an OLAP hierarchy is ag- 
gregated. The meaning is that the equality is true when t is instantiated with 
a value that, in the Time dimension, rolls up to the value '18' in the dimension 
level hour. For example, [rollup('ll/2Q/2007 18:52', 'hour',' Time') = '18']. 

The first constraint is satisfied by ecoA2, ecoc2, ecoMi, ecopi and ecop2- 
For all of them, there is a time instant within Interval{ecOi) that verifies the 
rollup predicate. The second constraint is satisfied by ecoAi and ecoci- However, 
given the temporal order, the only list of ECOs that satisfy the SE is: Li = 
{ecopi,ecoAi}. In Li, ['11/29/2007 16:45', '12/29/2007 18:50'] (the interval of 
ecopi) does not follow ['11/29/2007 15:45', '12/29/2007 17:10'] (the interval of 
ecoAi)- □ 

Definition 14. [Tol and Normalized Tol] Let I be a finite set of items. A Table 
Of Items (Tol) for T is a table with schema T = (OID, Items), where Items 
is the name of an attribute whose instances are items, and an instance of T 
is a finite set of tuples of the form {Oj,ik) inhere ik G I is an item associated 
to the object Oj. Moreover, given {Oj,ik) and {Oj,im), two tuples correspond- 
ing to the same object, and tk and t„i the transaction times of the items, then 
tk 7^ holds. A normalized Tol is a database containing a table with schema 
(OID, t, ID) (the Normalized Tol), and one table per category, each one with 
schema {ID,attri, ...,attrn,ts,tf^). □ 

Figure 6 shows an instance of a normalized Tol where items are related to the 
category instances of Figure 5. There are three sessions (sequences), Sessioni, 
Session2 and Session^, each one with an associated list of items. The three 
sessions clicked on URL C, but only Sessioni would satisfy the constraint [ID = 
'C A cat Name = 'Books'] (see Figure 5). 



Definition 15. [Temporal Matching of a S.E] Let SE be a sequential expres- 
sion of length k, and a normalized Tol (from now on, nToI), with schema 
{OID, t, ID). An object identified by OIDm temporally matches SE, if there exist 
k tuples in nToI, {OIDm,ti, IDi), {OIDm,t2, ID2), ■■ ., {OIDm,tk,IDk), where 
for at least one Lp £ SL^iSE), Lp = (ecoi, eco2, . . . , ecofe), ti £ Interval{ecoi) , 
yi = l..k. □ 

Example 7. Definition 15 states that if there is a temporally ordered sequence 
of k items such that all of their transaction times fall within the intervals of the 
k ECOs that satisfy the expression, then, we have a temporal match. 

With the category occurrences of Figure 5 and the instance of nToI depicted 
in Figure 6, we analyze the sequential expression SE = [ID = 'P^].\filter = 
'M']. The ECOs that satisfy the first constraint are ecopi and ecop2- The sec- 
ond constraint is satisfied by ecoca- Thus, the lists that satisfy SE are Li = 
{ecopi,ecoc3} and L2 = {ecop2, ecocs}- The object Sessioni temporally matches 
SE, since there exist two different tuples in Sessioni whose transaction times be- 
long to Interval{ecopi) and Interval{ecocz) , respectively. With a similar analy- 
sis, Session2 does not match the SE. The ECO ecocs did not exist when the user 
in this session clicked the last two URLs. Finally, Session^ temporally matches 
SE, because the transaction time of [{t, '11/29/2007 19:32'), (ID,'M')] belongs to 
the interval of ecop2, and the transaction time of [{t, '11/29/2007 20:00'), (ID,'C')] 
belongs to the interval of ecoca- Intuitively, this means that the user of Sessions 
could have chosen the URL with ID='P', which existed at the time she chose 
the URL with ID='M'. □ 

From Definition 15, it follows that if a list of ECOs does not satisfy a sequen- 
tial expression SE, no object in the nToI can use this list to temporally match 
SE. Thus, given that the lists in Sl^ (SE) are computed ovc^r the category oc- 
currences, which usually fit in main memory, unnecessary database scans can be 
avoided. 

Definition 16. [Temporal Satisfability of a Constraint] Given a constraint C 
and a normalized Tol, with schema {OID, t, ID), we say that a tuple in nToI 
fi = {OIDm,tm, IDm) temporally satisfies C if at least one of the following 
conditions hold: (a) if !F{C) is an atom of the form t = 'ct' where t is a term 
for temporal attributes of items, 'ct ' is a temporal constant in the temporal do- 
main with some granularity and 'ct' ~ t^; (b) if J-{C) is an atom of the form 
fn{t, 'ctl \ 'ct2\ 'ct„_i ') = 'ct\ where t is a tem,poral attribute, 'ct' is a tem- 
poral constant, and fn(tm, 'ctl', 'ct2',..., 'ci„_i ') = 'ct' ; (c) if !F{C) does not 
contain a temporal attribute; (d) if T{C) is a formula F\ A F2 and Fl and F2 
are satisfied by ji. □ 

Definition 17. [Total Matching of a Sequential Expression] Given a sequential 
expression SE=SE\.SE2...SEk of length k, and a normalized Tol with schema 

{OID, t, ID), we say that an object identified by OIDm totally matches SE, if 
there exists k different tuples jii, . . . , jik in nToI, of the form jii = {OIDm, ti,IDi), 
M2 = {OIDm,t2,ID2),...,fik = {OIDm,tk,IDk), and there is at least one list 



Lp = {ecoi,eco2, ■■■ecok),Lp £ Sl,,{SE), where the following conditions hold: 
(a) ti e Interval (ecoi), y i,i — l..k; (b) IDi is the identifier of the en- 
coded category occurrence {ecOi), \l i,i = l..k; (c) SEi is temporally satisfied by 
Hi, V i,= l..k. We denote each Lp a list of interest for SE. □ 

Property 1. Given a sequential expression SE=SEi.SE2...SEk of length k, and 
a normalized Tol T with schema {01 D, t, ID). If an object 01 Dj in T does not 
temporally match SE, then 01 Dj cannot totally match SE. □ 

Property 2. Given a sequential expression SE=SE-i.SE2-..SEk of length k, and a 

normalized Tol T with schema {DID, t, ID), such that there is an object 01 Dj 
in T that totally matches SE, then OIDj temporally matches SE. □ 

Example 8. Object Sessiom in Example 7, totally matches SE, using the second 
and third tuples, together with list Li. On the other hand, Session2 does not 
totally match SE, since it does not temporally match the expression. Finally, 
Sessions temporally matches SE, but it does not totally match it, because L2 
does not satisfy the second condition in Definition 17. □ 

Definition 18 (Temporal Support of SE). The temporal support of a se- 
quential expression SE, denoted Ts{SE), is the quotient between the number of 

different objects that totally match SE and the number of different objects that 
temporally match SE, if the latter is different to zero. Otherwise %(SE) = 0. 

□ 

Definition 18 formalizes the intuition behind the concept of temporal support, 
namely, counting only the sequences that could have potentially generated a 
matching sequence, given the temporal availability of the category occurrences to 
which an item in a sequence belong (these sequences are the ones that temporally 
match a SE). Classic support count, instead, considers the whole number of 
sequences in the database. 

Example 9. In the example above Ts{[ID = 'P'].[/i/ier = 'M']) = 0.5. The object 
Session2 is not considered in the support count because when the user clicked the 
Web page, it had not the possibility of selecting pages that satisfy the constraint. 

□ 

4 Temporal Support of Regular Expressions 

Having defined the temporal support of a sequential expression, we now move 
on to the general problem, i.e., defining the same concept for an RE. The data 
model defined in Section 2, and the theory developed in Section 3, allows us 
to define a language based on RE over constraints, that supports categorical 
attributes. Wc start with a simple example. Wc wish to restrict the result of 
an SPM algorithm to the sequences that match the following expressions: (a) 
S'£;i=[keyword='Games']; (b) SE2 = [keyword=' Games']. [filter ="]; (c) SE3 = 



[keyword='Ganies'].[filter="].[filter="]. For each SEi, i > 3, a condition [fil- 
ter="] is added. We are also interested in computing the temporal support of 
these sequential expressions. Instead of computing each support in a separate 
fashion, we may want to summarize these sequences in a single RE, namely: 
[keyword = 'Games']. ([/iZter = "])*. 

Definition 19. [R.E. over constraints] A regular expression over the constraints 
of Definition 10, is an expression generated by the grammar 

E< — C I E\E \E?\E* \E+ \ E.E \E\e 

where C is a constraint, e is the symbol representing the empty expression, 
' \' means disjunction, ' .' means concatenation, ' ?' "zero or one occurrence", 
' + ' "one or more occurrences", and ' * ' "zero or more occurrences". The 
precedence is the usual one. □ 

Property 3. Let £ be the set of sequential expressions SEi produced by a RE TZ, 
generated by the grammar of Definition 19. There is also a normalized Tol with 
schema {OID,t, ID). If an object Oi in the nToI temporally or totally matches 
any SE in £, Oi matches TZ, temporally or totally, respectively. □ 

Property 3 follows from observing that [keyword='Games'].[filter="]* could 
be written: [keyword=' Games'] | ([keyword='Games'].[filter="]) | ([keyword= 
'Games']. [filter="].[filter="]) | ([keyword='Games'].[filter="].[filter="].[filter="]) 

Reasoning along the same lines, since a regular expression over an alphabet 
(in our case, constraints) denotes the language that is recognized by a Deter- 
ministic Finite Automata (DFA), there exists a (possible infinite) set of strings 
over the alphabet that this DFA accepts. Each of these strings (actually, strings 
composed of constraints) matches our definition of SE. Then, we extend our 
previous definition of temporal and total matching of SE, to RE, as follows. 

Definition 20 (Temporal Matching of a RE). Given a regular expression 

TZ generated by the grammar of Definition 19, and the DFA A-r that accepts TZ. 
There is also a normalized Tol with schema {OID,t, ID). We say that OIDm 
temporally matches TZ, if there exists some n e N such that there exists at 
least one string of length n accepted by An, and OIDm temporally matches this 
string. □ 

Definition 21 (Total Matching of a RE). Given a regular expression TZ 
generated by the grammar of Definition 19, and the DFA An that accepts TZ. 
There is also a normalized Tol with schema {OID,t, ID). We say that OID^ 
totally matches RE, if there exists some n G N such that there is at least one 
string of length n accepted by An and OIDm totally matches this string. □ 

Definition 22 (Temporal Support of a RE). The temporal support of a 

regular expression TZ, denoted Tr{TZ), is the quotient between the number of dif- 
ferent objects that totally match TZ and the number of different objects that tem- 
porally match TZ, if the latter is different to zero. Otherwise Tr{TZ) =0. □ 



We use the example above to show how sequential expressions are sum- 
marized using the language of Definition 19. We use the category occurrences 
and the nToI shown in Figures 5 and 6, respectively. We first apply Defini- 
tion 12 in order to check satisfability of the constraints in the expressions SEi 
through SE3. ECOs ecoci, ecoc2 and ecocs from Figure 5 satisfy the constraint 
[keyword='Games']. Analogously, the constraint [filter="] is satisfied by ccoai, 
ecoc2, ecoMi and ecopi. Next, for each SE, we check satisfability applying Def- 
inition 13. For S'i?i = [keyword='Games'] we obtain Sl^{SE) = {Li = {ecoc2}, 
L2 = {ecocs}, Lo, = {ecoMi}}- For S'i!^2 = [keyword='Games'].[filter="] we have 
Sl2{SE) = {Li = {ecoc2,ecoc2}, L2 = {ecoc2, ecoMi}, L3 = {ecoc2,ecopi}, 
L4 = {ecoc3,ecoMi}, Lr, = {ecoMi,ecoMi}- Note that, for example, the list 
{ecoc2, ecoAi} does not satisfy SE2 because ecoc2 follows ecoAi - For SE-i= [key- 
word='Games'].[filter="].[filter="], we have SL^iSE) = {Li = {ecoc2,ecoc2, 
ecoc2}, {L2 = {ecoc2, ecoc2, ecoMi}, {L3 = {ecoc2, ecoc2, ecop\}, L4 = {ecoc2, 
ecoMi,ecoMi}, L5 = {ecoc2,ecopi, ecoc2}, Lq = {ecoc2,ecopi, bcomi}, Lj =■ 
{ecoc2,ecopi, ecopi}, Ls = {ecoc3,ecoMi,ecoMi}, Lg = {ccomi, ecoMi, ecoMi} 
Also here, many lists are discarded. For instance, {ecoMi, ecoMi,ecoAi} does not 
satisfy SE^ because bcomi follows bcoai- 

Now, we can compute the temporal support of each SE, applying Definitions 
15 through 18. For SEi, from the third tuple in Sessioni, and L2 = {ecocs}, 
we conclude that Sessioni totally (and hence, temporally) matches SEi. From 
the first tuple in Session2 and Li = {ecoc2}, Session2 totally matches SEi. 
From the first or third tuples in Session^, and L2 — {ecoca}. Sessions totally 
matches SEi. Finally, the temporal support of SEi is 3/3 — 1. 

In a similar way, we can conclude that the support of SE2 and SE3 are, re- 
spectively, 1 and 1/2. Since no session has four tuples, it is not necessary to ana- 
lyze a sequential expression of length four, like for instance [keyword=' Games']. 
[filter="] . [filter="] . [filter="] . 

5 Future Work 

We expect to extend our work in two ways. On the one hand, the theoretical 
framework introduced here allows to think in a more general definition of sup- 
port, with different semantics (not only temporal), that may enhance current 
data mining tools. On the other hand, we will develop an optimized implemen- 
tation of the algorithm that can support massive amounts of data. 
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